Take away of the session

What you will find in the CCA database claims

This is an extract of car claims handled by AXA Seguros & Direct Seguros.

We focus on 2014 claims.

Each claim is identified with claim_id.

Some claims are described on several rows and have a multitask_id.

The other IDs are explicit :

The CCA aims at scoring bodyshops to offer agreement to the best ones and send each claim to the right bodyshop. For that we need to build features per bodyshop for the scoring : use the garage_id.

First the target variables, related to cost.

The most important one cost_total is the cost of the reparation, before tax & discount.

This cost is split in several components : - labor cost for reparation/replacement cost_rep_lab - labor cost for painting cost_pt_lab - cost of painting material cost_pt_mat - cost of spare parts cost_rep_mat

these costs are completed by : - number of spare parts used in the reparation cost_number_sparep - the amount of the discount applied to the cost_total : cost_discount - the amount of the excess (participation of the insured in the cost of the claim) cost_excess - hourly cost of labour for reparation/replacement - hours of painting time_lab_pt - hours of labor for reparation/replacement time_lab_rep

In addition we created so dummy variables related to specific spareparts needed for the reparation : - headlight check_repair_headlight - door check_repair_door - front bumper check_repair_frontbumper - back bumper check_repair_backbumper - bonnet check_repair_bonnet - mudguard check_repair_mudguard

Note that most spareparts are specific to a car brand, model & year + in the db the references and names are often wrong. Here we defined the features out of very generic spareparts.

Severity of the claim is perhaps the most important feature to anticipate the cost of a claim.

Unfortunately, nothing is filled in the database prior to the expert assessment.

We know the information exist : think for instance about the location of damages, number of vehicles involved, circumstance of the claim, speed at the moment of the claim…

Here is what we could find : - guilt of the insured claim_guilt - involves a bodily injury ? claim_injuries - type of settlement : standard or convention between insurers claim_solving - guarantee affected : coverage_affected_a - type of claim : claim_type - channel for claim openning expert_order_opening_channel - was the vehicle replaced while the reparation ? vehicle_replace - did the bodyshop proceed to a pick up and delivery ? vehicle_pnd Other features can help anticipate the cost of the claim : - sale cost of the vehicle vehicle_sale_price. A more insightful quantity is the value of the vehicle prior to the claim (adjusted with age, mileage, reparations, former claims, etc.), but we don’t have it. - vehicle brand vehicle_brand - vehicle age vehicle_age One could also use date to compute : time UW to claim, claim to assessment & assessment to final payment of reparation or retrieval of the car

Sometimes we either want to : 1) explain the gap between two types of claims 2) or measure this gap think for instance of the agreed bodyshops garage_tac or the insurance entity company solutions are 1) build a predictive model strong enough to achieve good AUC, with the constraint of good interpretability. 2) build a model interpretable, as much as possible, as “ceteris paribus”. If you use GLM, use PLS to improve the consistency of the coefficients. Read the gap on the coefficient “ceteris paribus”.

In addition we also provide geolocation of both the bodyshop that repared the claim & the customer : customer_lon, customer_lat, garage_lon, garage_lat.

With these for features we can build the euclidean distance from the bodyshop to the customer distance_customer_garage.

As we know which bodyshop was recommended by AXA for the reparation, we also compute the distance from the customer to this bodyshop distance_customer_offered_garage.

As cost also depends on the local situation, we provide garage_postal_code, the postal code of the bodyshop, to easily compute local indicators of cost.

Though we strongly recommend to use k-NN based methods to build “local” features related to cost.

load("claims_data_training.RData")

Unsupervised techniques

Geometry of the data & distance computation

Geometry/topology of a dataset is a rich subject. In practice the first question one faces is : how do I get to compare factors, characters, booleans and numerics ?

The tools for putting it all together are “metrics” or “distances” : ## distance between words (see text mining session) * based on the spelling * based on the semantic * based on synthax ## distance between numerical vectors : example with 2 variables. What is the distance between a 50 years old who earns 100k€ and a 30 years old who earns 30k€ ? Should we ### Do nothing ?
- scatter plot

load("full_expert_order_geolocation.RData")

illustration=full_expert_order_all[,c("vehicle_sale_price","cost_number_sparep"),with=F]
illustration=sample(na.omit(illustration))[1:1000]
illustration$tag=c("A","B","C",rep(NA,997))
ggplot(illustration,aes(x=cost_number_sparep,y=vehicle_sale_price,color=is.na(tag)))+geom_point()+geom_text(aes(label=tag),color="red",size=5)
## Warning: Removed 997 rows containing missing values (geom_text).

dist(illustration[!is.na(tag)][,c("vehicle_sale_price","cost_number_sparep"),with=F])
##          1        2
## 2 1585.431         
## 3 3682.962 2097.564

This is not intuitive, is it ? - density plot

ggplot(illustration,aes(x=cost_number_sparep,y=vehicle_sale_price))+stat_density2d(aes(alpha=..level.., fill=..level..), size=2, bins=10, geom="polygon") + 
    scale_fill_gradient(low = "yellow", high = "red") +
    scale_alpha(range = c(0.00, 0.5), guide = FALSE) +
    geom_density2d(colour="black") +
    guides(alpha=FALSE)

### scale & center ? - scatter plot

illustration2=preProcess(illustration,method=c("center","scale"))
illustration2=predict(illustration2,illustration)
ggplot(illustration2,aes(x=cost_number_sparep,y=vehicle_sale_price,color=is.na(tag)))+geom_point()+geom_text(aes(label=tag),color="red",size=5)+xlim(-1,2)+ylim(-2,3)
## Warning: Removed 55 rows containing missing values (geom_point).
## Warning: Removed 997 rows containing missing values (geom_text).

dist(illustration2[!is.na(tag)][,c("vehicle_sale_price","cost_number_sparep"),with=F])
##           1         2
## 2 1.1006297          
## 3 0.8247473 1.8263455
ggplot(illustration2,aes(x=cost_number_sparep,y=vehicle_sale_price))+stat_density2d(aes(alpha=..level.., fill=..level..), size=2, bins=10, geom="polygon") + 
    scale_fill_gradient(low = "yellow", high = "red") +
    scale_alpha(range = c(0.00, 0.5), guide = FALSE) +
    geom_density2d(colour="black") +
    guides(alpha=FALSE)+xlim(-1,2)+ylim(-2,3)
## Warning: Removed 55 rows containing non-finite values (stat_density2d).
## Warning: Removed 55 rows containing non-finite values (stat_density2d).

The scale has changed so the metrics are changed but it looks the same. ### cap or remove extreme values - scatter plot

illustration2=illustration2[vehicle_sale_price%between%quantile(illustration2$vehicle_sale_price,c(0.025,0.975))]

illustration2=illustration2[cost_number_sparep%between%quantile(illustration2$cost_number_sparep,c(0.025,0.975))]

ggplot(illustration2,aes(x=cost_number_sparep,y=vehicle_sale_price,color=is.na(tag)))+geom_point()+geom_text(aes(label=tag),color="red",size=5)+xlim(-1,2)+ylim(-2,3)
## Warning: Removed 20 rows containing missing values (geom_point).
## Warning: Removed 929 rows containing missing values (geom_text).

dist(illustration2[!is.na(tag)][,c("vehicle_sale_price","cost_number_sparep"),with=F])
##           1         2
## 2 1.1006297          
## 3 0.8247473 1.8263455
ggplot(illustration2,aes(x=cost_number_sparep,y=vehicle_sale_price))+stat_density2d(aes(alpha=..level.., fill=..level..), size=2, bins=10, geom="polygon") + 
    scale_fill_gradient(low = "yellow", high = "red") +
    scale_alpha(range = c(0.00, 0.5), guide = FALSE) +
    geom_density2d(colour="black") +
    guides(alpha=FALSE)+xlim(-1,2)+ylim(-2,3)
## Warning: Removed 20 rows containing non-finite values (stat_density2d).
## Warning: Removed 20 rows containing non-finite values (stat_density2d).

### correct asymetry (skewness) and tails of the distribution (kurtosis) - scatter plot

temp=illustration
temp$vehicle_sale_price=log(temp$vehicle_sale_price)
illustration2=preProcess(temp,method=c("center","scale"))
illustration2=predict(illustration2,temp)
ggplot(illustration2,aes(x=cost_number_sparep,y=vehicle_sale_price,color=is.na(tag)))+geom_point()+geom_text(aes(label=tag),color="red",size=5)+xlim(-1,2)+ylim(-2,3)
## Warning: Removed 74 rows containing missing values (geom_point).
## Warning: Removed 997 rows containing missing values (geom_text).

dist(illustration2[!is.na(tag)][,c("vehicle_sale_price","cost_number_sparep"),with=F])
##           1         2
## 2 1.1008248          
## 3 0.8173244 1.8243297
ggplot(illustration2,aes(x=cost_number_sparep,y=vehicle_sale_price))+stat_density2d(aes(alpha=..level.., fill=..level..), size=2, bins=10, geom="polygon") + 
    scale_fill_gradient(low = "yellow", high = "red") +
    scale_alpha(range = c(0.00, 0.5), guide = FALSE) +
    geom_density2d(colour="black") +
    guides(alpha=FALSE)+xlim(-1,2)+ylim(-2,3)
## Warning: Removed 74 rows containing non-finite values (stat_density2d).
## Warning: Removed 74 rows containing non-finite values (stat_density2d).

### apply ranking ie margin transformation to make the data uniform - scatter plot

illustration2<-illustration
illustration2$vehicle_sale_price=round(rank(illustration2$vehicle_sale_price))
illustration2$vehicle_sale_price=illustration2$vehicle_sale_price/max(illustration2$vehicle_sale_price)
illustration2$cost_number_sparep=round(rank(illustration2$cost_number_sparep))
illustration2$cost_number_sparep=illustration2$cost_number_sparep/max(illustration2$cost_number_sparep)

ggplot(illustration2,aes(x=cost_number_sparep,y=vehicle_sale_price,color=is.na(tag)))+geom_point()+geom_text(aes(label=tag),color="red",size=5)
## Warning: Removed 997 rows containing missing values (geom_text).

dist(illustration2[!is.na(tag)][,c("vehicle_sale_price","cost_number_sparep"),with=F])
##           1         2
## 2 0.5042787          
## 3 0.1787316 0.5538086
ggplot(illustration2,aes(x=cost_number_sparep,y=vehicle_sale_price))+stat_density2d(aes(alpha=..level.., fill=..level..), size=2, bins=10, geom="polygon") + 
    scale_fill_gradient(low = "yellow", high = "red") +
    scale_alpha(range = c(0.00, 0.5), guide = FALSE) +
    geom_density2d(colour="black") +
    guides(alpha=FALSE)

Box-Cox power transformation provides a general framework for automatic selection of a convenient transformation to reshape the data.

It can be applied using preProcess() function, with method BoxCox from the package caret

http://www.isixsigma.com/tools-templates/normality/making-data-normal-using-box-cox-power-transformation/

distance between unordered factors

What are the distances between car brand or claims types ?

Are Audi, Toyota & Peugeot equidistant ?

Contrary to numerical variables, with factors, we cannot think of intuitive distances. ### GOWER metric using daisy function

illustration=full_expert_order_all[,c("coverage_affected_a","vehicle_brand","cost_number_sparep","garage_tac"),with=F]
illustration=sample(na.omit(illustration))[1:5]
library(cluster)
mix_metric=daisy(illustration,metric="gower")
illustration
##    vehicle_brand garage_tac    coverage_affected_a cost_number_sparep
## 1:       PEUGEOT          0 own damage with excess                  8
## 2:        NISSAN          0 own damage with excess                  2
## 3:    VOLKSWAGEN          0                    TPL                 12
## 4:       PEUGEOT          0                    TPL                  3
## 5:          OPEL          0                    TPL                 22
mix_metric
## Dissimilarities :
##        1      2      3      4
## 2 0.3250                     
## 3 0.5500 0.6250              
## 4 0.3125 0.5125 0.3625       
## 5 0.6750 0.7500 0.3750 0.4875
## 
## Metric :  mixed ;  Types = N, I, N, I 
## Number of objects : 5
  • some methods, such as MCA are based on dissimilarity matrix ie hardcoding each factor using n-1 dummy variables for a n-values factor.
  • another idea is to create numerical features using the factors as pivots. For instance : average value of the cars per brand, average cost of reparation per brand, orientation per brand… Anything we would like to test.

Methods for clustering

load data for toy examples (same data as for supervised)

load("toy_datasets.RData")
linear=rbind(linear_test,linear_train)
saturn=rbind(saturn_test,saturn_train)
moon=rbind(moon_test,moon_train)

k-means

theory

k-means (look at the last one to open on density based algo) http://www.onmyphd.com/?p=k-means.clustering&ckattempt=1

k-medoids (more sophisticated than k-means but much slower) https://www.youtube.com/watch?v=osjd5eV4ypA

Not many parameters.

When used for classification, k is tuned to minimize the error rate, thus train & test become important as a calibration is done based on a target.

Otherwise, the parameter is tuned to fit intuitions or visual compliance (see geo-clustering)

application on a toy example

linear$kmeans=kmeans(linear[,c("X1","X2"),with=F],centers = 2)$cluster
ggplot(data=linear,aes(x=X1,y=X2,color=factor(kmeans),shape=factor(Target)))+geom_point()

k=5
moon$kmeans=kmeans(moon[,c("X1","X2"),with=F],centers = k)$cluster
moon[,list(group1=mean(Target==1),group2=mean(Target==0),volume=.N),by=kmeans]
##    kmeans     group1     group2 volume
## 1:      3 0.02719033 0.97280967    662
## 2:      2 0.95731707 0.04268293    492
## 3:      5 0.06073446 0.93926554    708
## 4:      1 0.96288660 0.03711340    485
## 5:      4 0.77182236 0.22817764    653
ggplot(data=moon,aes(x=X1,y=X2,color=factor(kmeans),shape=factor(Target)))+geom_point(size=3)

k=8
saturn$kmeans=kmeans(saturn[,c("X1","X2"),with=F],centers = k)$cluster
saturn[,list(group1=mean(Target==1),group2=mean(Target==0),volume=.N),by=kmeans]
##    kmeans group1 group2 volume
## 1:      7      0      1    284
## 2:      8      1      0     44
## 3:      2      1      0     49
## 4:      4      1      0     38
## 5:      6      1      0     39
## 6:      3      1      0     44
## 7:      5      1      0     50
## 8:      1      1      0     52
ggplot(data=saturn,aes(x=X1,y=X2,color=factor(kmeans),shape=factor(Target)))+geom_point()

Load the map backend using google API

g<-qmap("Spain",maptype="roadmap",zoom=7)
save(list="g",file="map_spain.RData")

Application to geoclustering

load("map_spain.RData")
load("claims_data_training.RData")
claims[,c("garage_volume"):=list(.N),by="garage_id"]
location=unique(claims[garage_volume>10][,c("garage_id","garage_lon","garage_lat"),with=F])
cluster_kmeans=kmeans(location[,c("garage_lon","garage_lat"),with=F],centers = 10)
location$cluster=factor(cluster_kmeans$cluster)

hulls <- location[, .SD[chull(garage_lon, garage_lat)], by = cluster]
hulls <- hulls[cluster != "0",]

g+geom_point(data=location[!cluster=="0"],aes(x=garage_lon,y=garage_lat,color=cluster))+geom_polygon(data = hulls, aes(x=garage_lon,y=garage_lat,fill = cluster, group = cluster),alpha = 0.5, color = NA) 
## Warning: Removed 2581 rows containing missing values (geom_point).

Lots of outliers. Is it a problem ? Or does it mean that outliers are isolated bodyshops that can’t be substituted ?

density based clustering - DBSCAN (& OPTICS)

theory

visual example & illustration of the method

http://www.cse.buffalo.edu/~jing/cse601/fa12/materials/clustering_density.pdf

define the epsilon radius & the minimum number of point within the radius.

If a point hasn’t enough points in the neighborhood, it’s considered an outlier.

The limit of dbscan is when the local density is heterogeneous (slide 15/16)

It sounds consistent because it’s based on the density of points. It fits visual interpretation of “bags” of points

It’s possible to tune it smartly with k-nearest neighbor distance where k is the “min-points” parameter.

This is done in the subsection related to PCA & outliers removal on the claims database (PCA+DBSCAN)

For a dbscan, only one epsilon (radius) is being used, but the clustering can be iterated : 1) define a list of epsilon for which you want to create cluster (k-nn dist distribution & quantiles might be insightful) 2) start with the smallest epsilon 3) the cluster 0 stands for “not clustered” (outliers), run the next dbscan (next epsilon) on these.

step 3 needs to be iterated with each epsilon from the smallest to the highest.

application on a toy example

load("toy_datasets.RData")
linear=rbind(linear_test,linear_train)
saturn=rbind(saturn_test,saturn_train)
moon=rbind(moon_test,moon_train)
linear$dbscan=dbscan(linear[,c("X1","X2"),with=F],eps = 0.05,minPts = 10)$cluster
ggplot(data=linear,aes(x=X1,y=X2,color=factor(dbscan),shape=factor(Target)))+geom_point()

# try with eps 0.05, 0.15 (good), 0.3.
# Notice that the outliers are quite consistent, they are the "misclassified"
moon$dbscan=dbscan(moon[,c("X1","X2"),with=F],eps = 0.15,minPts = 50)$cluster
moon[,list(group1=mean(Target==1),group2=mean(Target==0),volume=.N),by=dbscan]
##    dbscan     group1     group2 volume
## 1:      2 0.01152482 0.98847518   1128
## 2:      1 0.97952756 0.02047244   1270
## 3:      0 0.40863787 0.59136213    602
ggplot(data=moon,aes(x=X1,y=X2,color=factor(dbscan),shape=factor(Target)))+geom_point()

# try with eps 1, 2 (good, no outiers), 5.
saturn$dbscan=dbscan(saturn[,c("X1","X2"),with=F],eps = 2,minPts = 10)$cluster
saturn[,list(group1=mean(Target==1),group2=mean(Target==0),volume=.N),by=dbscan]
##    dbscan group1 group2 volume
## 1:      1      0      1    284
## 2:      2      1      0    314
## 3:      0      1      0      2
ggplot(data=saturn,aes(x=X1,y=X2,color=factor(dbscan),shape=factor(Target)))+geom_point()

In the end we had to accept duplication of clusters to get a good classification when our eyes tell us there are 2 clusters only.

Application to geoclustering

load("bodyshop_geoclust.RData")
load("map_spain.RData")
claims[,c("garage_volume"):=list(.N),by="garage_id"]
location=unique(claims[garage_volume>10][,c("garage_id","garage_lon","garage_lat"),with=F])
cluster_dbscan=dbscan(location[,c("garage_lon","garage_lat"),with=F],eps = 0.3,minPts = 20,borderPoints = TRUE,search = "kdtree",splitRule = "SUGGEST")
location$cluster=factor(cluster_dbscan$cluster)

hulls <- location[, .SD[chull(garage_lon, garage_lat)], by = cluster]
hulls <- hulls[cluster != "0",]

g+geom_point(data=location[!cluster=="0"],aes(x=garage_lon,y=garage_lat,color=cluster))+geom_polygon(data = hulls, aes(x=garage_lon,y=garage_lat,fill = cluster, group = cluster),alpha = 0.5, color = NA) 
## Warning: Removed 2445 rows containing missing values (geom_point).

Mixture models & fuzzy clustering

consistent example on data (Gaussian) Mixture Clustering http://www.autonlab.org/tutorials/gmm14.pdf check slides 21-24

application on a toy example (different from linear/moon/saturn)

x1<-rnorm(n = 100,mean = 0,sd = .4)
y1<-rnorm(n = 100,mean = 0,sd = .4)
x2<-rnorm(n = 50,mean = 1,sd = .8)
y2<-rnorm(n = 50,mean = 2,sd = .8)
x3<-rnorm(n = 200,mean = 3,sd = .4)
y3<-rnorm(n = 200,mean = 0,sd = .8)
db=data.table(x=c(x1,x2,x3),y=c(y1,y2,y3),belong=c(rep(1,100),rep(2,50),rep(3,200)))

ggplot(data=db,aes(x=x,y=y,color=factor(belong)))+geom_point()

clust=Mclust(db,G=3)
cluster=apply(clust$z,1,which.max)
db$cluster=factor(cluster)
ggplot(data=db,aes(x=x,y=y,color=factor(cluster),shape=factor(belong)))+geom_point(size=3)

# classification quality
db[,list(mean(cluster==1),mean(cluster==2),mean(cluster==3)),by=belong]
##    belong   V1   V2   V3
## 1:      1 0.97 0.03 0.00
## 2:      2 0.00 0.98 0.02
## 3:      3 0.00 0.02 0.98

Application to geoclustering

load("map_spain.RData")
cluster_Mclust=Mclust(location[,c("garage_lon","garage_lat"),with=F],control = emControl(tol=1.e-6),warn = TRUE,G=1:20)
summary(cluster_Mclust)
save(list="cluster_Mclust",file="Mclust.RData")

Assign to the main cluster & visualize

load("Mclust.RData")
cluster=cluster_Mclust$z
cluster[cluster<0.1]=0

cluster=apply(cluster,1,which.max)
location$cluster=factor(cluster)
## Warning in `[<-.data.table`(x, j = name, value = value): Supplied 4329 items to be assigned to 4029 items of
## column 'cluster' (300 unused)
hulls <- location[, .SD[chull(garage_lon, garage_lat)], by = cluster]
hulls <- hulls[cluster != "0",]

load("map_spain.RData")
g+geom_point(data=location[!cluster=="0"],aes(x=garage_lon,y=garage_lat,color=cluster))+geom_polygon(data = hulls, aes(x=garage_lon,y=garage_lat,fill = cluster, group = cluster),alpha = 0.5, color = NA) 
## Warning: Removed 2581 rows containing missing values (geom_point).

why fuzzy ? how to use the proba

when the points overlap and classification is really uncertain (we lack information to properly split the data)

x1<-rnorm(n = 100,mean = 0,sd = .5)
y1<-rnorm(n = 100,mean = 0,sd = .5)
z1<-rnorm(n = 100,mean = 0,sd = .5)
x2<-rnorm(n = 50,mean = 1,sd = 1)
y2<-rnorm(n = 50,mean = 2,sd = 1)
z2<-rnorm(n = 50,mean = 10,sd = 1)
x3<-rnorm(n = 200,mean = 3,sd = .5)
y3<-rnorm(n = 200,mean = 0,sd = 1)
z3<-rnorm(n = 200,mean = -10,sd = 1)
example=data.table(x=c(x1,x2,x3),y=c(y1,y2,y3),z=c(z1,z2,z3),belong=c(rep(1,100),rep(2,50),rep(3,200)))

ggplot(example,aes(x=x,y=y,color=factor(belong)))+geom_point()

plot3d(example[,c("x","y","z"),with=F],col = factor(example$belong))

Imagine we only observe x & y and what to do the most consistent classification. fuzzy clustering looks better than previous methods for this application.

If you think of the claims clustering problem or bodyshop clustering, probabilities can be used to compute more sophisticated discrepancy amongst observations based on the fuzzy clustering only.

It can be seen as a tool for dimension reduction, like PCA.

limit : Useless for complicated shapes - toy examples

linear$Mclust=factor(apply(Mclust(linear[,c("X1","X2"),with=F])$z,1,which.max))
ggplot(data=linear,aes(x=X1,y=X2,color=factor(Mclust),shape=factor(Target)))+geom_point()

moon$Mclust=factor(apply(Mclust(moon[,c("X1","X2"),with=F])$z,1,which.max))
moon[,list(group1=mean(Target==1),group2=mean(Target==0),volume=.N),by=Mclust]
##    Mclust      group1      group2 volume
## 1:      4 0.004056795 0.995943205    493
## 2:      2 0.942460317 0.057539683    504
## 3:      3 0.033402923 0.966597077    479
## 4:      1 0.024096386 0.975903614    498
## 5:      6 0.954716981 0.045283019    530
## 6:      5 0.991935484 0.008064516    496
ggplot(data=moon,aes(x=X1,y=X2,color=factor(Mclust),shape=factor(Target)))+geom_point()

saturn$Mclust=factor(apply(Mclust(saturn[,c("X1","X2"),with=F],G=1:20)$z,1,which.max))
saturn[,list(group1=mean(Target==1),group2=mean(Target==0),volume=.N),by=Mclust]
##     Mclust      group1    group2 volume
##  1:      1 0.000000000 1.0000000     26
##  2:      2 1.000000000 0.0000000     34
##  3:      3 1.000000000 0.0000000     48
##  4:      8 0.006711409 0.9932886    149
##  5:      5 1.000000000 0.0000000     39
##  6:      6 1.000000000 0.0000000     45
##  7:      7 0.000000000 1.0000000     55
##  8:      9 1.000000000 0.0000000     44
##  9:     10 1.000000000 0.0000000     55
## 10:     11 1.000000000 0.0000000     49
## 11:      4 0.017857143 0.9821429     56
ggplot(data=saturn,aes(x=X1,y=X2,color=factor(Mclust),shape=factor(Target)))+geom_point()

PCA for dimension reduction

theory

  • dimension (variables) reduction
  • fix issue of high correlation amongst variables
  • PCA is not some black box method, there are metrics and visual tools to understand the components (eigen vectors) Rebalancing the axis to get more insight focusing on the first principal components http://setosa.io/ev/principal-component-analysis/
  • it’s basis for visualization as we see in the next example on Hierarchical Clustering
  • thus one can check the shape of the cluster
  • but also identify the location of the outliers
  • other applications
  • Partial Least Squares is a GLM based on PCA main components
  • rotation trees & rotation forests use PCA first component, which is different each time the variables and observations are sampled

basis for visualization - example of outliers cleaning

brands=full_expert_order_all[,list(cost_avg=mean(vehicle_sale_price,na.rm=T),cost_q.8=quantile(vehicle_sale_price,.8,na.rm=T),age_avg=mean(vehicle_age,na.rm=T),age_q.8=quantile(vehicle_age,.8,na.rm=T),volume=.N),by=vehicle_brand]
process=preProcess(brands,method=c("center","scale","BoxCox","pca"))
brands_processed=predict(process,brands)
load("claims_data_training.RData")
db=claims[,c("cost_total","cost_discount","cost_excess","cost_number_sparep","cost_pt_mat","cost_pt_lab","hcost_rep_lab","vehicle_sale_price","distance_customer_offered_garage","distance_customer_garage","garage_tac"),with=F]
res.pca <- PCA(db, quanti.sup = 1, quali.sup = 11,scale.unit=TRUE)

plot(res.pca, choix="var")

dimdesc(res.pca,axes=1:2)
## $Dim.1
## $Dim.1$quanti
##                                  correlation       p.value
## cost_pt_lab                       0.84244005  0.000000e+00
## cost_pt_mat                       0.84050798  0.000000e+00
## cost_total                        0.67955013  0.000000e+00
## cost_number_sparep                0.58722469  0.000000e+00
## cost_discount                     0.55041417  0.000000e+00
## cost_excess                       0.53824564  0.000000e+00
## hcost_rep_lab                     0.28630303  0.000000e+00
## vehicle_sale_price                0.13612663  0.000000e+00
## distance_customer_garage          0.04886999 2.149760e-136
## garage_tac                        0.03926150  1.090801e-88
## distance_customer_offered_garage  0.01998288  3.018101e-24
## 
## 
## $Dim.2
## $Dim.2$quanti
##                                   correlation       p.value
## distance_customer_garage          0.922454564  0.000000e+00
## distance_customer_offered_garage  0.921626476  0.000000e+00
## vehicle_sale_price                0.061537427 3.243281e-215
## hcost_rep_lab                     0.026788436  3.042502e-42
## cost_number_sparep                0.026192219  1.847748e-40
## cost_total                        0.021033732  1.092081e-26
## cost_excess                      -0.007740881  8.317662e-05
## cost_discount                    -0.012682340  1.139361e-10
## cost_pt_lab                      -0.049480535 8.900211e-140
## cost_pt_mat                      -0.050080300 3.847503e-143
## garage_tac                       -0.069329705 9.793773e-273
## Selection of some individuals
plot(res.pca,select="contrib 100") #plot the 100 individuals that have the highest cos2 

db_pca=data.table(res.pca$ind$coord[,1:2])
setnames(db_pca,names(db_pca),c("PC1","PC2"))

ggplot(data=sample(db_pca)[1:10000],aes(x=PC1,y=PC2))+geom_point()

deep dive into DBSCAN tuning - example of outliers cleaning

# Calibration of the parameters of DBSCAN using k_dist plot
KNN=get.knn(data = sample(db_pca)[1:10000],k=100)$nn.dist[,c(1:10*10)]
k_dist=c(apply(KNN,2,function(x)quantile(x,1:90/100)))
k_dist=data.table(x=rep(1:90,10),k_dist,k=sort(rep(1:10*10,90)))
g<-ggplot(data=k_dist,aes(x=x,y=k_dist,color=factor(k)))+geom_line()
g

# http://stackoverflow.com/questions/12893492/choosing-eps-and-minpts-for-dbscan-r


# tuning the parameters of DBSCAN to find the cluster structure we prefer
# grid=expand.grid(eps=1:10*3/100,minPts=c(50,100,500),method=c("kdtree","linear"),split=c("STD","MIDPT","FAIR","SL_MIDPT","SL_FAIR","SUGGEST"))
grid=expand.grid(eps=1:30/100,minPts=c(50),method=c("kdtree"),split=c("SUGGEST"))
extract=sample(db_pca)[1:25000]
tune=pbapply(grid,1,function(x){
  table(dbscan(x = extract,eps = x[1],minPts = x[2],search=x[3],splitRule = x[4])$cluster)
})

tune
## [[1]]
## 
##     0 
## 25000 
## 
## [[2]]
## 
##     0 
## 25000 
## 
## [[3]]
## 
##     0     1     2     3     4     5     6     7     8     9    10    11    12    13 
## 22149  1618   140    57    50   247    77   280    68   107    64    50    50    43 
## 
## [[4]]
## 
##     0     1     2     3     4     5     6     7     8     9 
## 16133  7773   158   130    83    78    38    37   103   467 
## 
## [[5]]
## 
##     0     1     2     3     4     5 
## 12927 11769   106    50   107    41 
## 
## [[6]]
## 
##     0     1 
## 10468 14532 
## 
## [[7]]
## 
##     0     1     2     3     4     5     6 
##  9003 15715    60    90    20    54    58 
## 
## [[8]]
## 
##     0     1     2     3 
##  7584 17315    50    51 
## 
## [[9]]
## 
##     0     1     2     3 
##  6387 18470    75    68 
## 
## [[10]]
## 
##     0     1     2 
##  5527 19422    51 
## 
## [[11]]
## 
##     0     1     2 
##  4995 19967    38 
## 
## [[12]]
## 
##     0     1 
##  4525 20475 
## 
## [[13]]
## 
##     0     1 
##  4112 20888 
## 
## [[14]]
## 
##     0     1 
##  3854 21146 
## 
## [[15]]
## 
##     0     1 
##  3556 21444 
## 
## [[16]]
## 
##     0     1     2 
##  3267 21695    38 
## 
## [[17]]
## 
##     0     1 
##  3109 21891 
## 
## [[18]]
## 
##     0     1 
##  2898 22102 
## 
## [[19]]
## 
##     0     1     2 
##  2649 22301    50 
## 
## [[20]]
## 
##     0     1 
##  2506 22494 
## 
## [[21]]
## 
##     0     1 
##  2385 22615 
## 
## [[22]]
## 
##     0     1 
##  2285 22715 
## 
## [[23]]
## 
##     0     1 
##  2187 22813 
## 
## [[24]]
## 
##     0     1     2 
##  2050 22889    61 
## 
## [[25]]
## 
##     0     1 
##  1964 23036 
## 
## [[26]]
## 
##     0     1 
##  1863 23137 
## 
## [[27]]
## 
##     0     1 
##  1784 23216 
## 
## [[28]]
## 
##     0     1 
##  1672 23328 
## 
## [[29]]
## 
##     0     1     2 
##  1535 23415    50 
## 
## [[30]]
## 
##     0     1     2 
##  1460 23486    54
# first iteration 
outliers=dbscan(extract,eps=0.5,minPts = 500)
table(outliers$cluster)
## 
##     0     1 
##  2762 22238
extract$cluster=outliers$cluster
ggplot(data=extract,aes(x=PC1,y=PC2,color=factor(cluster)))+geom_point()

# second iteration
extract_2=extract[cluster==0]
outliers=dbscan(extract_2,eps=1,minPts = 100)
table(outliers$cluster)
## 
##    0    1 
##  380 2382
extract_2$cluster=outliers$cluster
ggplot(data=extract_2,aes(x=PC1,y=PC2,color=factor(cluster)))+geom_point()

# second iteration
extract_3=extract_2[cluster==0]
outliers=dbscan(extract_3,eps=2,minPts = 50)
table(outliers$cluster)
## 
##   0   1   2 
##  57 246  77
extract_3$cluster=outliers$cluster
ggplot(data=extract_3,aes(x=PC1,y=PC2,color=factor(cluster)))+geom_point()

# now removes these 0.2% outliers.



# You can check the derived & improved methods : OPTICS

H-clust & distance computation

A LITTLE THEORY :

compute a distance between observation. http://www.analytictech.com/networks/hiclus.htm

Given a set of N items to be clustered, and an NxN distance (or similarity) matrix, the basic process of Johnson’s (1967) hierarchical clustering is this:

  1. Start by assigning each item to its own cluster, so that if you have N items, you now have N clusters, each containing just one item. Let the distances (similarities) between the clusters equal the distances (similarities) between the items they contain.
  2. Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now you have one less cluster.
  3. Compute distances (similarities) between the new cluster and each of the old clusters.
  4. Repeat steps 2 and 3 until all items are clustered into a single cluster of size N.

The question is, when assigning singleton to clusters, what metric should be used ? - minimum distance to points of the cluster - median distance to points of the cluster - distance to the medoid of the cluster - maximum distance to points of the cluster

APPLICATIONS

See the plots & cut the tree (hierarchy) + viz with PCA

Application to geoclustering

load("map_spain.RData")

cluster_hierchical=hclust(dist(location[,c("garage_lon","garage_lat"),with=F]))

plot(as.phylo(cluster_hierchical),type="fan",cex=0.5)

memb<-cutree(cluster_hierchical,k=25)

location$cluster=factor(memb)

hulls <- location[, .SD[chull(garage_lon, garage_lat)], by = cluster]
hulls <- hulls[cluster != "0",]

g+geom_point(data=location[!cluster=="0"],aes(x=garage_lon,y=garage_lat,color=cluster))+geom_polygon(data = hulls, aes(x=garage_lon,y=garage_lat,fill = cluster, group = cluster),alpha = 0.5, color = NA) 
## Warning: Removed 2581 rows containing missing values (geom_point).

Application to brand classification

cluster_hierchical=hclust(dist(brands_processed[,c("PC1","PC2","PC3"),with=F]))
plot(cluster_hierchical,labels =brands_processed$vehicle_brand)

memb<-cutree(cluster_hierchical,k=5)

brands_processed$hcluster=factor(memb)
brands_processed[,list(paste(vehicle_brand,collapse=",")),by=hcluster]
##    hcluster
## 1:        1
## 2:        2
## 3:        3
## 4:        4
## 5:        5
##                                                                                                                                                         V1
## 1: PEUGEOT,NISSAN,VOLKSWAGEN,OPEL,RENAULT,BMW,FIAT,FORD,CITROEN,AUDI,MERCEDES,NA,SUZUKI,SEAT,VOLVO,MITSUBISHI,CHRYSLER,ALFA ROMEO,JEEP-CHRYSLER,SAAB,ROVER
## 2:                                                                                                  MAZDA,HYUNDAI,HONDA,TOYOTA,MINI,KIA,CHEVROLET-GM,SKODA
## 3:                                                                                                               SSANGYONG,LEXUS,JAGUAR,LAND ROVER,PORSCHE
## 4:                                                                                                                                                   DACIA
## 5:                                                                                                                                                  DAEWOO

Visualization using PCA

hulls <- brands_processed[, .SD[chull(PC1,PC2)], by = hcluster]
ggplot()+geom_point(data=brands_processed,aes(x=PC1,y=PC2,color=hcluster))+geom_polygon(data = hulls, aes(x=PC1,y=PC2,fill = hcluster, group = hcluster),alpha = 0.5, color = NA)+geom_text(data=brands_processed,aes(x=PC1,y=PC2,label=vehicle_brand,color=hcluster))